ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO
نویسندگان
چکیده
Image-Text matching (ITM) is a common task for evaluating the quality of Vision and Language (VL) models. However, existing ITM benchmarks have significant limitation. They many missing correspondences, originating from data construction process itself. For example, caption only matched with one image although can be other similar images vice versa. To correct massive false negatives, we construct Extended COCO Validation (ECCV) Caption dataset by supplying associations machine human annotators. We employ five state-of-the-art models diverse properties our annotation process. Our provides $$\times $$ 3.6 positive image-to-caption 8.5 caption-to-image compared to original MS-COCO. also propose use an informative ranking-based metric mAP@R, rather than popular Recall@K (R@K). re-evaluate 25 VL on proposed benchmarks. findings are that benchmarks, such as 1K R@K, 5K CxC R@1 highly correlated each other, while rankings change when shift ECCV mAP@R. Lastly, delve into effect bias introduced choice annotator. Source code available at https://github.com/naver-ai/eccv-caption
منابع مشابه
Cross-Lingual Image Caption Generation
Automatically generating a natural language description of an image is a fundamental problem in artificial intelligence. This task involves both computer vision and natural language processing and is called “image caption generation.” Research on image caption generation has typically focused on taking in an image and generating a caption in English as existing image caption corpora are mostly ...
متن کاملTopic-Specific Image Caption Generation
Recently, image caption which aims to generate a textual description for an image automatically has attracted researchers from various fields. Encouraging performance has been achieved by applying deep neural networks. Most of these works aim at generating a single caption which may be incomprehensive, especially for complex images. This paper proposes a topic-specific multi-caption generator, ...
متن کاملMultimodal Pivots for Image Caption Translation
We present an approach to improve statistical machine translation of image descriptions by multimodal pivots defined in visual space. Image similarity is computed by a convolutional neural network and incorporated into a target-side translation memory retrieval model where descriptions of most similar images are used to rerank translation outputs. Our approach does not depend on the availabilit...
متن کاملImage Caption Generation with Recursive Neural Networks
The ability to recognize image features and generate accurate, syntactically reasonable text descriptions is important for many tasks in computer vision. Auto-captioning could, for example, be used to provide descriptions of website content, or to generate frame-by-frame descriptions of video for the vision-impaired. In this project, a multimodal architecture for generating image captions is ex...
متن کاملDeep image representations using caption generators
Deep learning exploits large volumes of labeled data to learn powerful models. When the target dataset is small, it is a common practice to perform transfer learning using pre-trained models to learn new task specific representations. However, pre-trained CNNs for image recognition are provided with limited information about the image during training, which is label alone. Tasks such as scene r...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Lecture Notes in Computer Science
سال: 2022
ISSN: ['1611-3349', '0302-9743']
DOI: https://doi.org/10.1007/978-3-031-20074-8_1